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Despite great effort spent measuring topological features of large networks like the Internet, it 
was recently argued that sampling based on taking paths through the network (e.g., traceroutes) 
introduces a fundamental bias in the observed degree distribution. We examine this bias analytically 
and experimentally. For classic random graphs with mean degree c, we show analytically that 
traceroute sampling gives an observed degree distribution P(k) ~ A; -1 for k < c, even though the 
underlying degree distribution is Poisson. For graphs whose degree distributions have power-law tails 
P(k) ~ k~ a , the accuracy of traceroute sampling is highly sensitive to the population of low-degree 
vertices. In particular, when the graph has a large excess (i.e., many more edges than vertices), 
traceroute sampling can significantly misestimate a. 



The Internet is a canonical complex network, and a 
great deal of effort has been spent measuring its topol- 
ogy. However, unlike the Web, where the outgoing links 
are directly visible, we cannot typically ask a router who 
are its neighbors. As a result, studies have sought to 
infer the topology of the Internet by aggregating ei- 
ther paths through the network (i.e., traceroutes from 
a small number of sources to a large number of destina- 
tions) routingdecisions like those imbedded 
in BGP routing tables 110,11, or both |1 ES E3. Al- 
though such methods are known to be noisvpl Il2l Il3| , 
they strongly suggest that the Internet has a power-law 
degree distribution at both the router and domain levels. 

However, Lakhina et al. [3] recently argued that 
traceroute-based sampling introduces a fundamental bias 
in topological inferences, since the probability that an 
edge appears within an efficient route decreases with 
the distance from the source. They showed empirically 
that traceroutes from a single source cause Erdos-Renyi 
random graphs G(n,p), whose underlying distribution is 
Poisson [l5j, to appear to have a power law degree dis- 
tribution P(k) ~ fc _1 . 

In this paper, we prove this result analytically by mod- 
eling the growth of a spanning tree on G(n,p) using dif- 
ferential equations. Certainly no one would argue that 
the Internet is a purely randonigraph; indeed, the degree 
distributions reported in e.g. |(| have P(k) ~ k~ a with 
2 < a < 3. However, it is evocative that traceroute sam- 
pling can create the appearance of a power-law degree 
distribution where none in fact exists. 

Even if the Internet has a power-law degree distribu- 
tion, it is reasonable to ask whether traceroute sampling 
gives an accurate estimate of the exponent a (a question 
raised also in 0). Here, we demonstrate that power- 
law degree distributions are only well sampled when the 
graph has a small excess, i.e., a mean degree close to 2, 
so that the graph is very treelike. Other cases can result 
in a significant over- or under-estimation of a. Indeed, 
the accuracy of traceroute sampling is highly sensitive to 
the low-degree part of the degree distribution, not just 
the high-degree tail. 



Traceroute spanning trees: analytical results. The set 
of traceroutes from a single source can be modeled as a 
spanning tree |l7j . If we assume that Internet routing 
protocols approximate shortest paths, this spanning tree 
is built breadth- first from the source. In fact, the results 
of this section apply to spanning trees built in a variety 
of ways, as we will see below. 

We can think of the spanning tree as built stcp-by-step 
by an algorithm that explores the graph. At each step, 
every vertex in the graph is labeled reached, pending, or 
unknown. Pending vertices are the leaves of the current 
tree; reached vertices are interior vertices; and unknown 
vertices are those not yet connected. We initialize the 
process by labeling the source vertex pending, and all 
other vertices unknown. Then the growth of the spanning 
tree is given by the following pseudocode: 

while there are pending vertices: 
choose a pending vertex v 
label v reached 

for every unknown neighbor u of v, 
label u pending. 

The type of spanning tree is determined by how we choose 
the pending vertex v. Storing vertices in a queue and 
taking them in FIFO (first-in, first-out) order gives a 
breadth-first tree of shortest paths; if we like we can 
break ties randomly between vertices of the same age 
in the queue, which is equivalent to adding a small noise 
term to the length of each edge as in [14J . Storing pend- 
ing vertices on a stack and taking them in LIFO (last-in, 
first-out) order builds a depth-first tree. Finally, choosing 
v uniformly at random from the pending vertices gives a 
"random-first" tree. 

Surprisingly, while these three processes build different 
trees, and traverse them in different orders, they all yield 
the same degree distribution when n is large. To illus- 
trate this, Fig. shows the degree distributions for each 
type of spanning tree for a random graph G{n,p = c/n) 
where n = 10 5 and c = 100. The three degree distribu- 
tions are indistinguishable, and all agree with the ana- 
lytic results derived below. 
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FIG. 1: Sampled degree distributions from breadth-first, 
depth-first and random-first spanning trees on a random 
graph of size n — 10 s and average degree c = 100, and our 
analytic results (black dots). For comparison, the black line 
shows the Poisson degree distribution of the underlying graph. 
Note the power-law behavior of the apparent degree distribu- 
tion P(k) ~ fc" 1 , which extends up to a cutoff at k ~ c. 



We now show analytically that building spanning trees 
in Erdos-Renyi random graphs G(n,p = c/n) using any 
of the processes described above gives rise to an apparent 
power law degree distribution P{k) ~ fc _1 for k < c. We 
focus here on the case where the average degree c is large, 
but constant with respect to n; we believe our results also 
hold if c is a moderately growing function of n, such as 
logn or rf for small e, but it seems more difficult to make 
our analysis rigorous in that case. 

To model the progress of the while loop described 
above, let S(T) and U(T) denote the number of pend- 
ing and unknown vertices at step T respectively. The 
expected changes in these variables at each step are 
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Here the pU (T) terms come from the fact that a given 
unknown vertex u is connected to the chosen pending 
vertex v with probability p, in which case we change its 
label from unknown to pending; the —1 term comes from 
the fact that we also change v's label from pending to 
reached. Moreover, these equations apply no matter how 
we choose v; whether v is the "oldest" vertex (breadth- 
first), the "youngest" one (depth-first), or a random one 
(random-first). Since edges in G(n,p) are independent, 
the events that v is connected to each unknown vertex u 
are independent and occur with probability p. 

Writing t = T/n, s(t) = S(tn)/n and u(tn) = U(t)/n, 
the difference equations Q become the following system 
of differential equations, 



du 
lit 



-cu 



d7 



cu - 



1 



(2) 



With the initial conditions u(0) = 1 and s(0) = 0, the 
solution to J2Jl is 



u(t) 



s(t) = l-t-e~ 



(3) 



The algorithm ends at the smallest positive root tf of 
s(t) = 0; using Lambert's function W, defined as W(x) = 
y where ye y = x, we can write 



t f = l+ ^W(-cc- c ) 



(4) 



Note that tf is the fraction of vertices which are reached 
at the end of the process, and this is simply the size of 
the giant component of G(n,c/n). 

Now, we wish to calculate the degree distribution 
P{k) of this tree. The degree of each vertex v is the 
number of its previously unknown neighbors, plus one 
for the edge by which it became attached (except for 
the root). Now, if v is chosen at time t, in the limit 
71 — ► oo the probability it has k unknown neighbors is 
given by the Poisson distribution with mean m = cu(t), 
Poisson(m, k) = e~ m m k jk\. Averaging over all the ver- 
tices in the tree gives 



P(k + 1) = 
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It is helpful to change the variable of integration to to. 
Since to = ce~ ct we have dm = —cmdt, and 
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Here in the second line we use the fact that tf « 1 — e~ c 
when c is large (i.e., the giant component encompasses 
almost all of the graph). 

The integral in (J5J is given by the difference between 
two incomplete Gamma functions. However, since the 
integrand is peaked at to = k — 1 and falls off exponen- 
tially for larger to, for k < c it coincides almost exactly 
with the full Gamma function T(fc). Specifically, for any 
c > we have 

/>cc -c 

dTOe^ m TO fe ~ 1 < ce" c 
and, if k — 1 = c(l — e) for e > 0, then 
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This is o(T(k)) if e > 1/y/k, i.e., if k < c — c a for some 
a > 1/2. In that case we have 



P(k +!) = (!- o(l)) 
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giving a power law A: -1 up to k ~ c. 

Although we omit some technical details, this deriva- 
tion can be made mathematically rigorous using results 
of Wormald 0], w h° showed that under fairly generic 
conditions, the state of discrete stochastic processes like 
this one is well-modeled by the corresponding rescaled 
differential equations. Specifically it can be shown that 
if we condition on the initial source vertex being in the 
giant component, then with high probability, for all t 
such that < t < tf, U(tn) = u(t)n + o(n) and 
S(tn) = s(t)n + o(n). It follows that with high probabil- 
ity our calculations give the correct degree distribution 
of the spanning tree within o(l). 

Power-law degree distributions. We now turn to the 
case where the Internet does have a power-law degree 
distribution P(k) ~ k~ a , and demonstrate that unless 
the excess, i.e., the number of edges minus the number 
of vertices, is small, traceroute sampling can significantly 
misestimate a. 

There are several methods of constructing random 
graphs with power-law degree distributions and we use 
two to support our claim: the configuration model |19| | in 
which the graph is random but conditioned on its degree 
distribution, and preferential attachment |20| in which 
the graph is grown by a dynamical process and has a 
degree distribution with a power-law tail. 

In the configuration model, we examined graphs where 
P(k) — k~ a /((a) for all k > 1, with a ranging from 1.5 
to 3. Since these graphs are not necessarily connected, we 
compare the sampled and underlying degree distributions 
of the giant component (the latter has a power-law tail 
with the same exponent as the entire graph). Fig.[21shows 
that as a increases, the observed distribution gets closer 
to the underlying distribution. This closeness is because, 
for instance, when a = 3 the ratio of edges to vertices 
in the giant component is only 1.02 so its excess is only 
0.02 per vertex. Thus any spanning tree on the giant 
component will include almost all of its edges, and sample 
its degree distribution fairly well. 

However, the size and excess of the giant component 
are highly sensitive to the low-degree part of the degree 
distribution, not just its power-law tail. To illustrate this, 
Fig. [3] shows graphs grown using the preferential attach- 
ment model of [2QJ . Here every vertex has degree at least 
to, since it is given m edges at birth. As to increases, 
the slope of the observed distribution on a log-log plot 
becomes more shallow, giving a significant underestimate 
of a; for instance, for to = 4 we observe a slope of 2.7 
rather than the correct value a = 3. Using the configura- 
tion model to construct random graphs with a minimum 
degree m and a degree distribution with a power-law tail 
yields similar results. 
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FIG. 2: Comparison of underlying and observed degree dis- 
tributions in the configuration model with n = 5 x 10 5 and 
various a. 
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FIG. 3: Displacement of the power law tail for preferential 
attachment networks with n — 5 x 10°. Traceroute sampling 
significantly underestimates the slope a as m increases. 



This underestimation of a occurs because traceroutes 
sample high-degree vertices more accurately than lower- 
degree ones: high-degree vertices are encountered early 
on in the breadth-first tree, when most of their neigh- 
bors are still unknown, while lower-degree vertices are 
encountered later, by which time most of their neighbors 
are already reached. Thus the "visibility" of a vertex's 
edges increases with its degree 0], making the slope of 
the observed distribution less negative. 

For smaller values of a, Fig. |3 shows that traceroute 
sampling encounters another kind of problem at smaller 
values of a, namely significant finite-size effects. The 
observed value of a is roughly correct up to a "knee," 
above which the degree distribution falls off more sharply. 
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For a = 1.5, for instance, Fig. 0] shows that this knee 
occurs at a degree k ~ n 5 . In these cases, a linear 
fit to the observed degree distribution will considerably 
over-estimate a unless we omit the data above the knee. 
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FIG. 4: Finite size effects for traceroute sampling with a = 
1.5, with data collapse for various n. Above a "knee" at n ' 5 
the observed degree distribution falls off more sharply. 



Conclusions. There are two properties of the Inter- 
net which make it difficult to map: unlike the World 
Wide Web where links are visible, the Internet's topol- 
ogy must be queried indirectly, e.g., by traceroutes; and, 



since efficient routing protocols cause these traceroutes 
to approximate shortest paths, edges far from the source 
are difficult to see. It was observed by Q that these 
effects can significantly bias the observed degree distri- 
bution, and even create the appearance of a power law 
where none exists. We have proved this result analyti- 
cally for random graphs G(n,p), which yield an observed 
distribution P(k) ~ k~ 1 for k up to the average degree. 
Other mechanisms by which power laws can appear in 
G{n,p) include gradient-based flows [U, probabilistic 
pruning |22j . and minimum spanning trees on weighted 
random graphs j2^| . 

While it seems likely that the Internet does have a 
power-law distribution, we have shown that traceroute 
sampling can signficantly misestimate the scaling expo- 
nent a. Thus we suggest that the published values of a 
may not accurately reflect the real scaling of the Inter- 
net's topology. This poses an interesting inverse prob- 
lem: namely, given the value of a observed in tracer- 
outes, what is the most likely value of a in the underlying 
graph? Also, since traceroutes from a single source, or 
a small number of sources (briefly explored in [lj, [l(| ) , 
are inherently biased, how many sources are needed, as 
a function of network size and topology, to accurately 
sample the network? 
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